Lab 07a: Decision tree classification

Introduction

This lab focuses on SMS message spam detection using decision tree and random forest classification. It's a direct counterpart to the rule-based spam detection from Lab 05. At the end of the lab, you should be able to use scikit-learn to:

Create a decision tree classification model and a random forest classification model.
Use the models to predict new values.
Measure the accuracy of the models.

Getting started

Let's start by importing the packages we'll need. As usual, we'll import pandas for exploratory analysis, but this week we're also going to use the tree subpackage from scikit-learn to create decision tree models and the ensemble subpackage to create random forest models. We'll also use the dummy package to build a baseline model from we which can gauge how good our final model is, just like we did in Lab 06.



In [ ]:

    
import pandas as pd

from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

Next, let's load the data. Write the path to your sms.csv file in the cell below:



In [ ]:

    
data_file = 'data/sms.csv'

Execute the cell below to load the CSV data into a pandas data frame with the columns label and message.

Note: This week, the CSV file is not comma separated, but instead tab separated. We can tell pandas about the different format using the sep argument, as shown in the cell below. For more information, see the read_csv documentation.



In [ ]:

    
sms = pd.read_csv(data_file, sep='\t', header=None, names=['label', 'message'])
sms.head()

Finally, let's select our feature matrix $X$ and target variable $y$ from the data:



In [ ]:

    
X = sms['message']
y = sms['label']

Dummy model

Before we build the decision tree model, let's build a dummy model, i.e. a naive model that predicts new values using a simple strategy. Dummy models are usually not good predictors (we usually won't use them to solve real problems), but are useful because they provide a baseline accuracy measurement for the data set, from which we can gauge the accuracy of any further models we build. In Lab 06, we built a dummy regression model as a baseline for our linear regression model.

scikit-learn provides dummy model functionality via the dummy subpackage. This subpackage provides both dummy regression and classification algorithms, which can be customised with different strategies. We can use the DummyClassifier class to build a dummy classification model. DummyClassifier supports five different strategies for predicting values:

strategy='stratified': Predict new values randomly, but in proportion to their frequency in the training set.
strategy='most_frequent': Predict new values as the most frequently occurring target variable in the training set.
strategy='prior': Predict new values as the most frequently occurring target variable in the training set. This is essentially the same as strategy='most_frequent', but returns different values when the predict_proba method is called.
strategy='uniform': Predict new values randomly, with equal probability.
strategy='constant': Predict new values as some constant value (the constant value must also be specified using the constant keyword argument).

Data transformation

Our data is in the form of raw text. This was fine when we were using the custom rule-based model in Lab 05, but this was a bit of a cheat because the model was written specially for that lab. In general, algorithms can only consume numerical values, and so we'll need to transform the data into a suitable representation. One popular way to do this with text data is to compute the TF-IDF score for each word in our dataset, just as we did in Lab 04b.

The TF-IDF score is composed of the term frequency (TF) and inverse document frequency (IDF) measures:

Term frequency is a measure of how often a given term appears in a given document, e.g. how often the word "free" appears in a given SMS message. The more often a word appears in a document, the higher its term frequency.
Inverse document frequency is a measure of how rare a word is in a set of documents, e.g. the word "the" appears commonly in many SMS messages and so its presence (or absence) provides little information when it comes to distinguishing spam from ham. The higher the inverse document frequency of a word, the rarer it is (and, therefore, the more distinguishing power it has).

Typically, term frequency and inverse document frequency are combined as a single metric, term frequency-inverse document frequency (TF-IDF), which is simply the multiple of the individual values. Consequently, if a term has a high TF-IDF score, its presence across a set of documents (e.g. SMS messages) is low, while its number of occurrences in a given document (e.g. a candidate SMS message under evaluation) is high. If a term has a low TF-IDF score, this is an indicator that it doesn't appear very frequently in a given document, occurs very frequently across the set of documents, or both. We can exploit this information to find terms that can distinguish a certain set of documents (e.g. spam) from a larger set of documents.

Pipelines

In Lab 04b, we used the TfidfVectorizer class to compute TF-IDF scores for the words in the messages in our SMS dataset. Now, we need to use its numerical output as the input to our dummy model. While we could save the output to a temporary variable, and feed this to the input of our dummy model, there is a convenient alternative.

In scikit-learn, pipelines represent chains of transformation operations ending in an estimator (i.e. a classifier or regressor). Once a pipeline has been created, we can treat it just like a model, calling the fit method to fit our data and the predict method to predict new values. The general structure of a pipeline looks like this:

+----------------+     +---------------+             +-----------+
| Transformer #1 +---->+ Transfomer #2 +---->...---->+ Estimator |
+----------------+     +---------------+             +-----------+

In this case, we want to build a pipeline consisting of a TD-IDF scorer followed by a dummy model. We can do this using the make_pipeline function from the pipeline subpackage of scikit-learn, as follows:



In [ ]:

    
pipeline = make_pipeline(
    TfidfVectorizer(stop_words='english'),                  # Remove English stop words before computing TF-IDF
    DummyClassifier(strategy='stratified', random_state=0)  # Use a stratified prediction strategy
)

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)  # Stratified 5 fold cross validation
y_pred = cross_val_predict(pipeline, X, y, cv=outer_cv)               # Make predictions via cross validation

print(classification_report(y, y_pred))  # Print the classification report

As can be seen, the dummy model results are poor - much worse than our rule-based model from Lab 05:

87% of the messages we labelled as ham were actually ham (precision for ham = 0.87).
16% of the messages we labelled as spam were actually spam (precision for spam = 0.16).
We labelled 85% of actual ham as ham (recall for ham = 0.85).
We labelled 17% of actual spam as spam (recall for spam = 0.17).

Still, the dummy model is useful as a "worst case" baseline to which we can compare our further analysis.

Decision tree classification

Let's start by building a decision tree classification model of the SMS message data. scikit-learn supports decision tree functionality via the tree subpackage. This subpackage supports both decision tree regression and classification. We can use the DecisionTreeClassifier class to build our model.

DecisionTreeClassifier accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params method of the estimator (this works on any scikit-learn estimator), like this:



In [ ]:

    
DecisionTreeClassifier().get_params()

You can find a more detailed description of each parameter in the scikit-learn documentation.

Let's use a grid search to select the optimal decision tree classification model from a set of candidates. First, we need to build a pipeline, just as we did above. Next, we define the parameter grid. Finally, we use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.

Note: When using grid search with pipelines, we have to adjust the names of our hyperparameters, prepending the name of the class they apply to (in lowercase). This is so that scikit-learn can distinguish which hyperparameters apply to what classes. Below, we prepend the string 'decisiontreeclassifier__' to each hyperparameter name because they all apply to the DecisionTreeClassifier class.



In [ ]:

    
pipeline = make_pipeline(
    TfidfVectorizer(stop_words='english'),
    DecisionTreeClassifier(random_state=0) # Control randomness, so the lab works the same for everyone
)

# Build models for different values of criterion, min_samples_leaf and min_samples_split
parameters = {
    'decisiontreeclassifier__criterion': ['gini', 'entropy'],
    'decisiontreeclassifier__min_samples_leaf': [1, 10, 20],
    'decisiontreeclassifier__min_samples_split': [2, 10, 20]  # Min value is 2
}

# Use inner CV to select the best model
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)  # K = 5

clf = GridSearchCV(pipeline, parameters, cv=inner_cv, n_jobs=-1)  # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)

# Use outer CV to evaluate the error of the best model
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)  # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)

print(classification_report(y, y_pred))  # Print the classification report

The model is much more accurate than both the dummy model above and the rule-based model from lab 05. Specifically, we can say that:

97% of the messages we labelled as ham were actually ham (precision for ham = 0.97).
91% of the messages we labelled as spam were actually spam (precision for spam = 0.91).
We labelled 99% of ham as ham (recall for ham = 0.99).
We labelled 87% of spam as spam (recall for spam = 0.87).

This is a significant increase in spam detection ability over both the dummy model and the rule-based model from Lab 05! Further improvements can be made by expanding the ranges of parameter grid values or introducing further hyperparameters (e.g. other stopping criteria).

After fitting a pipeline object, we can check the parameters of the final model as follows:



In [ ]:

    
clf.best_params_

It's also possible to plot a flow chart representation of a decision tree, although this is beyond the scope of this lab.

Random forest classification

Decision trees are useful, but tend to create models that are highly sensitive to slight changes in the values of their input features. One solution to this drawback is the use of random forests, which tend to blur the decision boundaries in regions where there is uncertainty as to which class is the correct choice.

Let's build a random forest classification model of the SMS message data to see if we can improve over the simple decision tree model. scikit-learn supports ensemble model functionality via the ensemble subpackage. This subpackage supports random forest regression and classification, as well as several other ensemble modelling algorithms. We can use the RandomForestClassifier class to build our model.

Just like earlier, we can list the hyperparameters of the algorithm by calling the get_params method on it:



In [ ]:

    
RandomForestClassifier().get_params()

As can be seen, we can choose to specify the same hyperparameters as with decision tree classifiers (plus a few more - see the official documentation for more detail). But let's keep it simple and just specify the number of subtrees to include in the forest (feel free to adjust or add to the grid search parameters, if you want to). We can do this using the n_estimators attribute, as follows:



In [ ]:

    
pipeline = make_pipeline(
    TfidfVectorizer(stop_words='english'),
    RandomForestClassifier(random_state=0)  # Control randomness, so the lab works the same for everyone
)

# Build models for different numbers of estimators
parameters = {
    'randomforestclassifier__n_estimators': [5, 10, 25, 50]
}

# Use inner CV to select the best model
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)  # K = 5

clf = GridSearchCV(pipeline, parameters, cv=inner_cv, n_jobs=-1)  # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)

# Use outer CV to evaluate the error of the best model
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)  # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)

print(classification_report(y, y_pred))  # Print the classification report

As can be seen, the random forest model is the most accurate spam detector that we've built so far:

98% of the messages we labelled as ham were actually ham (precision for ham = 0.98).
99% of the messages we labelled as spam were actually spam (precision for spam = 0.99).
We labelled every actual ham as ham (recall for ham = 1.00).
We labelled 84% of spam as spam (recall for spam = 0.84).

Further improvements can be made by expanding the ranges of parameter grid values or introducing further hyperparameters (e.g. impurity measures, stopping criteria).